For help on which statistical test to use, and how to implement it in R (or SAS, Stata, or SPSS), have a look at this link But let’s start with descriptive statistics.

# on average, how many miles with 1 gallon can you ride with the cars in the mtcars dataset?
mean(mtcars$mpg)
## [1] 20.09062
# what is the maximum of miles you can do with 1 gallon?
max(mtcars$mpg)
## [1] 33.9
# and the minimum?
min(mtcars$mpg)
## [1] 10.4
# how many car models with 4, 6, or 8 cylinders?
table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

Descriptives (ddply)

With the package ddply you can easily obtain a lot of descriptive statistics grouped in a neat and convenient way. For example, we could want to know how many cars with different numbers of cylinders exist in the mtcars dataset, what the percentage of cars with 4, 6, or 8 cylinders is, and what their minimum, maximum, and average miles per gallon value is.

means <- ddply(mtcars, c( "cyl"), summarise, 
               N_cars = length(cyl),
               percentage_cars = round( (100/nrow(mtcars) * N_cars) ),
               mpg_min = min(mpg),
               mpg_max = max(mpg),
               mpg_mean = round( mean(mpg)) )
means
##   cyl N_cars percentage_cars mpg_min mpg_max mpg_mean
## 1   4     11              34    21.4    33.9       27
## 2   6      7              22    17.8    21.4       20
## 3   8     14              44    10.4    19.2       15

Correlation

A correlation measures the strength and the direction of an association between 2 variables.
The strength varies between +1 and -1, where 1 indicates a perfect degree of association and 0 no association at all. The direction can be positive (+) or negative (-).
 

Common measures of correlations are called: Pearson and Spearman.

For the Pearson’s r correlation, both variables should be normally distributed.

Spearman’s rho instead is a non-parametric rank correlation (data is ranked), which does NOT require the data to be normally distributed.  

r and rho are also effect sizes (they should therefore be writte in italic).

According to Cohen, Pearsons’s r correlation coefficients can be roughly classified like this (the same categorization applies to Spearman’s rho):

The square of r (r2) determines how much variance is shared by the 2 correlated variables

To correlate several variables with each other, use cor()from the stats package

# here we correlate the last 5 columns of the mtcars dataset with each other
cor(mtcars[,4:8], use="complete.obs", method="spearman")  
##              hp        drat         wt        qsec         vs
## hp    1.0000000 -0.52012499  0.7746767 -0.66660602 -0.7515934
## drat -0.5201250  1.00000000 -0.7503904  0.09186863  0.4474575
## wt    0.7746767 -0.75039041  1.0000000 -0.22540120 -0.5870162
## qsec -0.6666060  0.09186863 -0.2254012  1.00000000  0.7915715
## vs   -0.7515934  0.44745745 -0.5870162  0.79157148  1.0000000
# with "complete.obs" missing values are handled by casewise deletion

 

To see the p value use cor.test() (but only works for 1 coefficient)

cor.test(mtcars[,4], mtcars[,5], use="complete.obs", method="spearman")  
## 
##  Spearman's rank correlation rho
## 
## data:  mtcars[, 4] and mtcars[, 5]
## S = 8293.8, p-value = 0.002278
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.520125

 

Even better is function rcorr() from the Hmisc package. It gives correlation coefficients and significance levels together.

library(Hmisc)
rcorr(as.matrix(mtcars[,4:8]), type="spearman") # Input must be a matrix!
##         hp  drat    wt  qsec    vs
## hp    1.00 -0.52  0.77 -0.67 -0.75
## drat -0.52  1.00 -0.75  0.09  0.45
## wt    0.77 -0.75  1.00 -0.23 -0.59
## qsec -0.67  0.09 -0.23  1.00  0.79
## vs   -0.75  0.45 -0.59  0.79  1.00
## 
## n= 32 
## 
## 
## P
##      hp     drat   wt     qsec   vs    
## hp          0.0023 0.0000 0.0000 0.0000
## drat 0.0023        0.0000 0.6170 0.0102
## wt   0.0000 0.0000        0.2148 0.0004
## qsec 0.0000 0.6170 0.2148        0.0000
## vs   0.0000 0.0102 0.0004 0.0000

Corrplots and heatmaps

if you quickly wanted to convey to the reader the strenght of correlations between several variables, you should use plots.

We have discussed before the scatterplot. But a fancier, and sometimes more useful, way of showing correlation strenghts is a corrplot

Use the function corrplot() from the package with the same name. The size and color of the circles indicate the strength and direction of the correlation. check out the many options.

# install.packages("corrplot")
library(corrplot)
mycorrs <- cor(mtcars[,4:8], use="complete.obs", method="spearman")  
corrplot(mycorrs)

Othertimes so-called heat maps are useful to quickly spot clusters of associations amongst many variables.

For a famous example, check out the paper “Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey”, published in Neuron by Kriegeskorte and colleagues (2008)

 

Now let’s do our own (slightly less fancy) heat map with the mtcars dataset

mycorrs <- cor(mtcars[,4:8], use="complete.obs", method="spearman")  
heatmap(mycorrs)

Student’s t-test

The t-test allows us to test

It is important to remember, that with a t-test you can only compare a max of 2 groups, or 2 conditions. Use ANOVA or linear mixed models if you have more groups and/or conditions.

By default, the function t.test() assumes unequal variances and applies the Welsh degrees of freedom (df) modification. To specify equal variances, use the var.equal = TRUE option.

Also by default, a 2-tailed test is performed. Specify alternative="less" or alternative="more" to run a 1-tailed test.

One-sample t-test

  • For testing if the mean of a sample differs significantly from 0 (or any other number).

For example, we could test if the average horse power of all cars in the dataset mtcars differs from 0 (which it does), and from 150 (which it doesn’t)

t.test(mtcars$hp, mu = 0)
## 
##  One Sample t-test
## 
## data:  mtcars$hp
## t = 12.103, df = 31, p-value = 2.794e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  121.9679 171.4071
## sample estimates:
## mean of x 
##  146.6875
t.test(mtcars$hp, mu = 150)
## 
##  One Sample t-test
## 
## data:  mtcars$hp
## t = -0.2733, df = 31, p-value = 0.7864
## alternative hypothesis: true mean is not equal to 150
## 95 percent confidence interval:
##  121.9679 171.4071
## sample estimates:
## mean of x 
##  146.6875

Independent t-test

  • For testing the difference between 2 unrelated samples (i.e. different subjects in two groups or conditions - imagine comparing men with women, or drug 1 with drug 2)

Attention: The specification of the formula depends depending on whether your data is in long or wide format

In long format we specify a numeric and a binary factor (separated by “~”):

# select only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1  |  ChickWeight$Diet == 2, ]

# average across chicken by time and diet
ChickWeight_temp_long <- aggregate(ChickWeight_temp, by=list(ChickWeight_temp$Time, ChickWeight_temp$Diet), FUN = mean, na.rm = TRUE)

# make a bit nicer
ChickWeight_temp_long <- ChickWeight_temp_long[, c(1:3)]
colnames(ChickWeight_temp_long)[1:2] <- c("Time", "Diet")

# now test if weight differs between diet 1 and 2
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet) )
## 
##  Welch Two Sample t-test
## 
## data:  ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -63.98631  30.13885
## sample estimates:
## mean in group 1 mean in group 2 
##        105.6929        122.6167

Assume equal variances (this changes the dfs, and slightly the p value):

t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE )
## 
##  Two Sample t-test
## 
## data:  ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.4625
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -63.85473  30.00727
## sample estimates:
## mean in group 1 mean in group 2 
##        105.6929        122.6167

Change to 1-tailed t-test (modifies the p value)

t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative="less")
## 
##  Two Sample t-test
## 
## data:  ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.2312
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 21.93463
## sample estimates:
## mean in group 1 mean in group 2 
##        105.6929        122.6167
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative="greater")
## 
##  Two Sample t-test
## 
## data:  ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.7688
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -55.78209       Inf
## sample estimates:
## mean in group 1 mean in group 2 
##        105.6929        122.6167

Now let’s put the same data in wide format, to compare 2 numeric factors (separated by “,”):

library(reshape2)
ChickWeight_temp_wide <-  dcast(ChickWeight_temp_long, Time ~ Diet , value.var = "weight")
colnames(ChickWeight_temp_wide)[2:3] <- c("Diet1", "Diet2")
t.test(ChickWeight_temp_wide$Diet1 , ChickWeight_temp_wide$Diet2 )
## 
##  Welch Two Sample t-test
## 
## data:  ChickWeight_temp_wide$Diet1 and ChickWeight_temp_wide$Diet2
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -63.98631  30.13885
## sample estimates:
## mean of x mean of y 
##  105.6929  122.6167

It is important to point out that you get the same t and p values and dfs using the data in long or wide format!

Dependent (paired-sample) t-test

  • For testing difference between two related samples (i.e. same people tested twice)

We could use the same ChickenWeight data, but this time compare identical chicken across 2 time points.

IMPORTANT: add the option PAIRED

In long format we specify a numeric and a binary factor (separated by “~”):

# select only Time 1 and 2:
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
t.test(ChickWeight_temp$weight ~ as.factor(ChickWeight_temp$Time), paired=TRUE )
## 
##  Paired t-test
## 
## data:  ChickWeight_temp$weight by as.factor(ChickWeight_temp$Time)
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.190877 -7.129123
## sample estimates:
## mean of the differences 
##                   -8.16

In the wide format, we specify 2 numeric factors (separated by “,”)

library(reshape2)
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
ChickWeight_temp_wide <-  dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
t.test(ChickWeight_temp_wide$time0 , ChickWeight_temp_wide$time2, paired=TRUE )
## 
##  Paired t-test
## 
## data:  ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.190877 -7.129123
## sample estimates:
## mean of the differences 
##                   -8.16

Mann-Whitney U & Wilcoxon signed ranks tests

One of the assumptions of the t-test, is that the data are normally distributed. However, even in cases when the data is NOT normally distributed, one can use the t-test as long as the sample size is reasonably large: N at least >10, better >100. Thus, the t-test is quite robust for violations of the normality assumption, as long as the sample is not too small.

An alternative to the t-test exists , which does not require normal distribution of the data:

The non-parametric Mann-Whitney U test (also called Mann-Whitney-Wilcoxon, Wilcoxon rank-sum, or Wilcoxon-Mann-Whitney test) can be used as alternative to the independent t-test.

Itis calculated in 2 different ways, depending on the sample size (<20 or >20). In any case, it involves some ranking of the values of both groups, and does not assume normal distribution of the data.

The Wilcoxon signed ranks test can be used as an alternative to the dependent t-test

Implementation of these nonparametric tests is easy, and the formula almost identical to the t-test one.

Here an example of the Mann-Whitney U test:

# select again only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test or Mann-Whitney):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1  |  ChickWeight$Diet == 2, ]
wilcox.test(ChickWeight_temp[,1] , ChickWeight_temp[,2])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ChickWeight_temp[, 1] and ChickWeight_temp[, 2]
## W = 115600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

And here an example of the Wilcoxon signed ranks test:

# select only Time 1 and 2 (applied to all chicken):
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
# put Time in wide format
library(reshape2)
ChickWeight_temp_wide <-  dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
# perform statistics:
wilcox.test(ChickWeight_temp_wide$time0 , ChickWeight_temp_wide$time2, paired=TRUE )
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## V = 8, p-value = 1.132e-09
## alternative hypothesis: true location shift is not equal to 0

Linear models (regression)

Simple regression

A simple linear regression is similar to a correlation: You want to know the link between two variables. In a regression, however, you go further than in a correlation, as you want to know how much y depends on (is predicted by) x.

A regression therefore tries to predict changes in the dependent variable (DV, the thing you measured) based on 1 or more independent variables (IVs, predictors), even for a range of the IV that is yet unknown.

For example, the “mtcars” dataset includes cars that have between 52 and 335 horse power. By fitting a regression model with miles per gallon (mpg) as DV, and horse power (hp) as IV, we can try to predict what the mpg would be for a car with 400 hp.

The general formula for a linear regression is lm(DV ~ IV, yourdata), where DV is the dependent variable (what you measured), and IV the independent one (the factor partly explaining or predicting the DV). Get the result using summary(), or anova().

Let’s test if the horsepower of cars predicts their gas consumption (in mpg):

model1 <- lm(mpg ~ hp, mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

How to read the output:

R-squared (R2) informs us about the amount of variance of mpg that is explained by the model (in this case 60%). Since a simple regression only includes 1 IV, R2 also represents how much variance is explained by hp. If you take the square root of R2, you get the Pearson correlation coefficient r: sqrt(0.6024) = 0.77 Thus, there is a strong correlation between mpg and hp.

The Adjusted R2 indicates how well the model generalizes to the population. This value should be close to R2, but is typically a bit smaller. The adjusted R-squared also adjusts for the fact that models with more IVs (see multiple regression) will automatically have a greater R2.

The significant F-Statistic in the last line informs us that the model predicts well the DV. Since the model only includes 1 IV, the F-Statistic in this case also tells us that hp is a good predictor of mpg.

The most important part of the output is the Coefficients table:

  • the estimate in the first line (Intercept), gives the intercept of the model’s line, i.e. how many miles per gallon you could drive if your car had 0 horse power: 30! (our model obviously doesn’t understand much of cars, otherwise it would know that a car with 0 hp does not drive at all)
  • The line starting with “hp” gives: the estimate of how much mpg changes for 1-point change in hp (it is the slope, or gradient, of the line best representing the data). The t value indicates the magnitude of the IV’s effect, and the p value its statistical significance. Notice that the estimate for hp is negative, because indeed there is a negative relation between mpg and hp: the stronger the car, the fewer mpg of gas you can drive. The estimate also tells us that for every additional horse power you lose 0.068 mpg. Thus, a car with 400 hp would have 2.8 mpg (the result of 30.09886 - (0.06823 * 400)).

Let’s verify these things by plotting the scatterplot with the fitted regression line. To have the line extend all the way from 0 to 400 hp, we need to include xlim(0,400) and the fullrange=TRUE option.

library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) + 
  geom_point() +
  xlim(0, 400) +
  stat_smooth(method = "lm", fullrange=TRUE)

In the summary of the regression model, the 3rd line from the bottom provides the degrees of freedom (df = 30).

The statistics can be reported like this: hp significantly predicted mpg, as confirmed by linear regression (t(30) = -6.742, p < .001, R-squared = .6).

Multiple regression

Same as simple regression, only that you want to predict changes in the DV based on several IVs.

A more complicated model, that icludes several predictors, could help explain more variance, and thus achieve a greater R-squared.

You can compare the fit (R2) of different regression models using the function anova().

For example, do we explain more variance by also including the car’s cylinders?

model1 <- lm(mpg ~ hp, mtcars)
model2 <- lm(mpg ~ hp + cyl, mtcars)
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + cyl
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 447.67                                  
## 2     29 291.97  1     155.7 15.465 0.0004804 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes we do. The comparison of the 2 models showed that model2 fits better to the data. The adjusted R2 of the 2nd model (the more complex one that includes 2 predictors) is 0.72, and thus significantly bigger than for the 1st model (0.59).

c("adj R2 mod1: ", round( summary(model1)$adj.r.squared, digits=2) )
## [1] "adj R2 mod1: " "0.59"
c("adj R2 mod2: ", round( summary(model2)$adj.r.squared, digits=2) )
## [1] "adj R2 mod2: " "0.72"

By using multiple instead of simple regression, we were able to explain 13% more variance!

Which of the 2 IVs is more important in mod2? To answer that question, we can look at their estimates, SEs, and t-values:

summary(model2)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 36.9083305 2.19079864 16.846975 1.620660e-16
## hp          -0.0191217 0.01500073 -1.274718 2.125285e-01
## cyl         -2.2646936 0.57588924 -3.932516 4.803752e-04

Only cyl significantly predicts mpg (as indicated by the p value being < .05).Therefore, the number of cylinders of a car has a bigger effect on gas consumption than horse power.

Another way to look at this is by using the standardized estimates. For that, we use the function lm.beta() from the package QuantPsyc

library(QuantPsyc)
lm.beta(model2)
##         hp        cyl 
## -0.2175294 -0.6710802

The standardized beta estimates indicate that gas consumption increases (mpg decreases) by 0.67 standard deviations (SDs), when the number of cylinders increases by 1 SD. The same change in hp only leads to a 0.21 SD change in mpg.

Similarly, the 95% confidence interval (CI) of the unstandardized beta estimate for cyl does not cross 0, compared to the CI for hp. This clearly indicates that among the 2 predictors, the number of cylinders is more reliable (the slope is always negative) and plays a bigger role.

confint(model2)
##                   2.5 %      97.5 %
## (Intercept) 32.42764417 41.38901679
## hp          -0.04980163  0.01155824
## cyl         -3.44251935 -1.08686785

We can also represent the size of each estimate, and the CI around it, using the function plot_model() of the package sjPlot. I have sorted the biggest effect at the top.

library(sjPlot)
plot_model(model2)

One could also add even more IVs to the model. However, adding displacement (disp) does not significantly improve model fit

model2 <- lm(mpg ~ hp + cyl, mtcars)
model3 <- lm(mpg ~ hp + cyl + disp, mtcars)
anova(model2, model3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 291.98                              
## 2     28 261.37  1    30.605 3.2787 0.08093 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding weight (wt) does improve the model significantly:

model2 <- lm(mpg ~ hp + cyl, mtcars)
model4 <- lm(mpg ~ hp + cyl + wt, mtcars)
anova(model2, model4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 291.98                                  
## 2     28 176.62  1    115.35 18.287 0.0001995 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, the R2adj is now 0.83.

But notice how adding weight to the regression, makes hp and cyl to insignificant predictors (their p values >.05). This is the case because the IVs included in the model are not independent of each other. The estimate (slope) of each IV assumes that the other IVs stay constant, and in fact correspond to their average values. Thus, when weight and horse power do not change, the number of cylinders no longer plays a role!!

summary(model4)
## 
## Call:
## lm(formula = mpg ~ hp + cyl + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11
library(sjPlot)
plot_model(model4)

Main effects and interactions in multiple regressions

Multiple regressions include more than one predictor (or IV). There are several ways in which you can enter these predictors in the model.

First, you may want to only look at the main (or simple) effects of each predictor separately (just separate them by + signs).

Second, you may want to also investigate if the predictors interact with each other.

If you have more than 2 predictors, you can either look at all the main and interaction effects, or specify only certain interactions. The "*" sign specifies main and interaction effects of the 2 IVs on each side of the star. The “:” sign specifies only the interaction of the 2 IVs on each side of the colon.

The table shows you how to write the corresponding formulas.

Number of effects Formula
a 3 main effects IV1 + IV2 + IV3
b 3 main effects and 1 2-way interaction IV1 * IV2 + IV3
c Alternative writing of model “b” IV1 + IV2 + IV1:IV2 + IV3
d 3 main effects and 2 2-way interactions IV1 + IV2 + IV3 + IV1:IV2 + IV1:IV3
e 3 main effects, 3 2-way interactions and 1 3-way interaction IV1 * IV2 * IV3
f Alternative writing of model “e” IV1+IV2+IV3+IV1:IV2+IV1:IV3+IV2:IV3+IV1:IV2:IV3

And these are the models using mpg as DV and hp, cyl, and disp as IVs.

model_a <- lm(mpg ~ hp + cyl + disp, mtcars)
model_b <- lm(mpg ~ hp * cyl + disp, mtcars)
model_c <- lm(mpg ~ hp + cyl + hp : cyl + disp, mtcars)
model_d <- lm(mpg ~ hp + cyl + disp + hp : cyl + hp : disp, mtcars)
model_e <- lm(mpg ~ hp * cyl * disp, mtcars)
model_f <- lm(mpg ~ hp + cyl + disp + hp:cyl + hp:disp + cyl:disp + hp:cyl:disp, mtcars)

Plot regression effects

Several packages exist that allow you to quickly visualize the results of your linear models.

A good and versatile package is sjPlot(). Check out this helpful vignette for more details.

Let’s visualize the slope of the effects of hp on mpg

library(sjPlot)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot_model(model2, type = "pred", terms = "hp")

No we plot the same but for the different cylinders.

library(sjPlot)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot_model(model2, type = "pred", terms = c("hp", "cyl") )

And here with 3 IVs:

library(sjPlot)
model3 <- lm(mpg ~ hp + cyl + disp, mtcars)
plot_model(model3, type = "pred", terms = c("hp", "disp", "cyl") )

Another commonly used one is the package effects()

library(effects)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot(predictorEffect("hp", model2))

Also check out function effect_plot() from package jtools

Regression with continuous vs. categorical predictors

In the previous examples of simple and multiple regression, we have used continuous predictors, i.e. columns in the data frame (variables) that contain numbers.

class(mtcars$mpg)
## [1] "numeric"
class(mtcars$cyl)
## [1] "numeric"
class(mtcars$wt)
## [1] "numeric"

Classical continuous variables are things that can occur in a quasi infinite variety (e.g. weight, temperature, or miles per gallon).

We can also use categorical predictors in a regression. These are variables that include 2 or more categories (e.g. male and female, or tall and short, etc).

The number of categories in a categorical variable is small (say between 2 and 5). But the separation between a categorical and a continuous variable is not always clear cut. In fact, as long as the number of elements or possibilities in a variable is relatively small, a categorical variable can be used as a continuous predictor, and the other way around.

The variable “cyl” only contains 3 categories, as cars in the mtcars dataset only have either 4, 6, or 8 cylinders. If we convert it to a factor, it will be treated in the model as a categorical predictor.

unique(mtcars$cyl)
## [1] 6 4 8
mtcars$cyl <- as.factor(mtcars$cyl)

Let’s start with a simple regression in which “mpg” gets predicted by “cyl”

model5 <- lm(mpg ~ cyl, mtcars)
summary(model5)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
## cyl6         -6.9208     1.5583  -4.441 0.000119 ***
## cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

As you can see from the summary, there are now 2 lines for cyl, and they have the names “cyl6” and “cyl8”. This is because lm() has automatically taken the cars with 4 cylinders as the reference level, and compared the cars with 6 and 8 cylinders to the cars with 4 cylinders.

The intercept now describes the average mpg for the cars with 4 cylinders:

mean(mtcars[mtcars$cyl==4,]$mpg)
## [1] 26.66364

The estimate for “cyl6” is -6.9208, which means that cars with 6 cylinders on average can drive 6.9 mpg less than the cars with 4 cylinders. Indeed, the average mpg for cars with 6 cylinders is 19.7, which also corresponds to 26.6 - 6.9.

mean(mtcars[mtcars$cyl==6,]$mpg)
## [1] 19.74286
26.66364 - 6.9208
## [1] 19.74284

The estimate for “cyl8” is -11.5636, which means that cars with 8 cylinders on average can drive 11.6 mpg less than the cars with 4 cylinders. Indeed, the average mpg for cars with 8 cylinders is 15.1, which also corresponds to 26.6 - 11.6

mean(mtcars[mtcars$cyl==8,]$mpg)
## [1] 15.1
26.66364 - 11.5636
## [1] 15.10004

LMMs like to think in lines. Therefore, the sign and the size of the estimate also tell you the slope of the lines connecting the average mpg for cars with 4 cylinders and the average mpg for cars with either 6 or 8 cylinders.

In the 2 plots below, you can see that the blue slope going from 4 to 6 cylinders (on the left) is less steep than the red line going to 8 cylinders (on the right) - nevermind the fact that the x-axis also shows 6 on the right (I recoded the cyl numbers only for display here).

mtcars$cyl_num <- as.numeric(levels(mtcars$cyl))[mtcars$cyl]  # cyl as numeric
datatemp1 <- mtcars[mtcars$cyl != 8,] # exclude cars with 8 cylinders
datatemp2 <- mtcars[mtcars$cyl != 6,] # exclude cars with 6 cylinders
library(ggplot2)
gg_6 <- ggplot(datatemp1, aes(x = cyl_num, y = mpg)) + 
  geom_point(alpha = 0.3, size = 3) + 
  geom_smooth(data = datatemp1, method='lm',se=FALSE, colour = "blue") +
  scale_x_continuous(name = "cylinders", limits=c(3.5, 6.5), breaks=c(4, 6)) + 
  scale_y_continuous(limits=c(10, 35)) + 
  ggtitle("4 to 6 cyl") +
  theme_bw()
datatemp2$cyl_num <- ifelse(datatemp2$cyl_num == 8, 6, 4) # recode 8 to 6, otherwise the plot is larger
gg_8 <- ggplot(datatemp2, aes(x = cyl_num, y = mpg)) + 
  geom_point(alpha = 0.3, size = 3) + 
  geom_smooth(data = datatemp2, method='lm',se=FALSE, colour = "red") +
  scale_x_continuous(name = "cylinders", limits=c(3.5, 6.5), breaks=c(4, 6)) + 
  scale_y_continuous(limits=c(10, 35)) + 
  ggtitle("4 to 8 cyl") +
  theme_bw()
library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
## 
## Attaching package: 'cowplot'
## The following objects are masked from 'package:sjPlot':
## 
##     plot_grid, save_plot
cowplot::plot_grid(gg_6, gg_8) 

Things can become more complicated, if you add more IVs to the model. For example, we could add horse power as numeric.

mtcars$cyl <- as.factor(mtcars$cyl)
model6 <- lm(mpg ~ cyl * hp  , mtcars)
summary(model6)
## 
## Call:
## lm(formula = mpg ~ cyl * hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7600 -1.8152 -0.1971  1.5012  7.1606 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.98303    3.88908   9.252 1.04e-09 ***
## cyl6        -15.30917    7.43456  -2.059  0.04962 *  
## cyl8        -17.90295    5.25961  -3.404  0.00216 ** 
## hp           -0.11278    0.04575  -2.465  0.02061 *  
## cyl6:hp       0.10516    0.06848   1.536  0.13672    
## cyl8:hp       0.09853    0.04862   2.026  0.05310 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.028 on 26 degrees of freedom
## Multiple R-squared:  0.7882, Adjusted R-squared:  0.7475 
## F-statistic: 19.35 on 5 and 26 DF,  p-value: 5.019e-08

As in model 5, we first get the coefficients “cyl6” and “cyl8”, which reflect the comparison of cars with 6 and 8 cylinders to those with 4 cylinders. The coefficients have slightly changed, because now the comparison occurs for cars with an average horsepower.

Then follows one line showing the coefficient for the predictor “hp”. It is only one line, because this variable was entered as numeric.

Finally, we have the coefficients for the interaction of cyl and hp. More specifically, “cyl6:hp” indicates how the relationship (slope) between mpg and hp changes from cars with 4 cylinders to those with 6 cylinders. Similarly, “cyl8:hp” indicates how the slope for hp changes from cars with 4 cylinders to those with 8 cylinders.

Assumptions for linear regression

There are a number of assumption the linear regression model will make about the data.

Linearity

Plot the model’s residuals (the distances from each data point to the fitted regression line), which ideally should be equally distributed around the horizontal line. If they aren’t, the assumption of linearity might be violated. In that case, try transforming the data, e.g. with a log-transform. Or maybe you missed some important factor (DV) in the model.

To inspect the residuals, you can plot them against the fitted values:

# rm(mtcars) # delete the mtcars in the environment
model6 <- lm(mpg ~ wt, mtcars)
plot(fitted(model6),residuals(model6))

Or plot the residuals agains the actual data:

model6 <- lm(mpg ~ wt, mtcars)
model6_res = resid(model6)
plot(mtcars$wt, model6_res, ylab="Residuals", xlab="Car weight" ) + abline(0, 0)  

## integer(0)

Absence of collinearity

Collinearity basically means that 2 or more of the IVs are highly correlated with each other. This can be problematic for the model, and its interpretation. It also means that will deal with larger standard errors and confidence intervals, resulting in fewer chances to reject H0.

One (approximate) way to check for possible collinearity, is to see if some of the IVs in the model are highly correlated. For that, create a pair-wise correlation plot.

rm(mtcars)
round(cor(mtcars[, c(2:4, 6)]), 2) # correlate 4 IVs of interest
##       cyl disp   hp   wt
## cyl  1.00 0.90 0.83 0.78
## disp 0.90 1.00 0.79 0.89
## hp   0.83 0.79 1.00 0.66
## wt   0.78 0.89 0.66 1.00

Or this one which is even nicer

library(corrplot)
cor1 = cor(mtcars[, c(2:4, 6)])
corrplot.mixed(cor1, lower.col = "black")

Another sign of possible collinearity is a very high R2, when at the same time most of the coefficients are not significant.

To more formally quantify collinearity you can use the variance inflation factor (VIF). For example, the function vif() from the package car will compute the VIF of each IV by regressing it onto the other IVs of the model (The actual formual is 1/1-R2 of that IV).

A VIF of over 10 typically indicates high multicollinearity, while a VIF < 4 or 5 indicates acceptable collinearity.

Although model7 includes the correlated IVs cyl and disp, its VIF is not very high.

model7 <- lm(mpg ~ cyl + disp + hp + wt, mtcars)
car::vif(model7)
##       cyl      disp        hp        wt 
##  6.737707 10.373286  3.405983  4.848016

It seems like the IV “disp” is particularly problematic. The VIFs are all much lower if we remove it from the model.

model8 <- lm(mpg ~ cyl  + hp + wt, mtcars)
car::vif(model8)
##      cyl       hp       wt 
## 4.757456 3.258481 2.580486

The Farrar-Glauber test and other model-wide diagnostics measures of collinearity can be computed with the omcdiag() function in the mctest package

library(mctest)
omcdiag(mtcars[, c(2:4, 6)], mtcars$mpg)
## 
## Call:
## omcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.0115         0
## Farrar Chi-Square:       123.9875         1
## Red Indicator:             0.8131         1
## Sum of Lambda Inverse:    25.3650         1
## Theil's Method:            0.7094         1
## Condition Number:         26.8753         0
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

Out of 6 tests, collinearity was detected by 4. This suggests that multicollinearity is likely.

The function imcdiag includes the VIF and 6 other multicollinearity diagnostics, for each IV individually:

library(mctest)
imcdiag(mtcars[, c(2:4, 6)], mtcars$mpg)
## 
## Call:
## imcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##          VIF    TOL      Wi       Fi Leamer    CVIF Klein
## cyl   6.7377 0.1484 53.5519  83.1967 0.3853 -0.5667     1
## disp 10.3733 0.0964 87.4840 135.9126 0.3105 -0.8724     1
## hp    3.4060 0.2936 22.4558  34.8867 0.5418 -0.2864     0
## wt    4.8480 0.2063 35.9148  55.7962 0.4542 -0.4077     0
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## cyl , disp , hp , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.8486 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

If you only want the VIF, you can specify so in the call of the formula:

imcdiag(mtcars[, c(2:4, 6)], mtcars$mpg, method="VIF")
## 
## Call:
## imcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg, method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##          VIF detection
## cyl   6.7377         0
## disp 10.3733         1
## hp    3.4060         0
## wt    4.8480         0
## 
## Multicollinearity may be due to disp regressors
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

ANOVA (one-way, two-way, within, between, mixed, rep measures)

The Analysis of Variance (ANOVA) is a special case of a linear regression, or linear model, in which the predictor variable(s) are categorical (instead of continuous)

For example, you could have the between-subjects (meaning different for each subject) factor “treatment” with 3 levels: drug A, drug B, and placebo. And/or the within-subjects (meaning every subject does all of them) factor Condition, with 3 levels: Attention to faces, to houses, and control condition.

One-way ANOVA

The fact that linear regression and ANOVA are similar, becomes evident by the fact that you can also look at the results of a linear regression using the anova() function.

Let’s again regress mpg on hp, using the mtcars data.

summary(lm(mpg ~ hp, mtcars))
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Notice that in the last line the F-statistics are given. Indeed, you could also look at the results of linear regression using the anova() function:

anova(lm(mpg ~ hp, mtcars))
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hp         1 678.37  678.37   45.46 1.788e-07 ***
## Residuals 30 447.67   14.92                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another way to run an ANOVA and get a results table similar to the one of SPSS, is to use the aov() function

The first argument is the dependent variable (DV). It is followed by the tilde symbol (~) and the independent variable(s) (IVs). Finally you indicate the dataset to be used. You store the analysis in an object, and then look at the results using summary()

It is also possible to get the means and number of subjects per cell using print(model.tables()) - only makes sense with a categorical predictor

Let’s test again if mpg differs by hp

model2 <- aov(mpg ~ hp, mtcars)
summary(model2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## hp           1  678.4   678.4   45.46 1.79e-07 ***
## Residuals   30  447.7    14.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And to use a categorical predictor, let’s see if mpg differs by number of cylinders (which it does)

# first convert cyl from numeric to factor
mtcars_temp <- mtcars
mtcars_temp$cyl <- as.factor(mtcars_temp$cyl)
# then run the anova
model3 <- aov(mpg ~ cyl, mtcars_temp)
summary(model3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          2  824.8   412.4    39.7 4.98e-09 ***
## Residuals   29  301.3    10.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A quick way to get the means and number of cars per cylinder-group is to use model.tables():

print(model.tables(model3,"means"),digits=3)
## Tables of means
## Grand mean
##          
## 20.09062 
## 
##  cyl 
##        4    6    8
##     26.7 19.7 15.1
## rep 11.0  7.0 14.0

Do we get the same means with ddply? Yes we do!

ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), mpg_mean = round(mean(mpg), digits=1))
##   cyl N_cars mpg_mean
## 1   4     11     26.7
## 2   6      7     19.7
## 3   8     14     15.1

To display boxplots of mpg by cylinder-group:

boxplot(mpg ~ cyl, data=mtcars_temp) 

repeated measures ANOVA

See this introduction to repeated measures Analysis of Variance (rmANOVA) by DataCamp.

There are different ways of computing a rmANOVA. For example, in his book Discovering Statistics Using R Andy Field lists 4 ways.

Of these, the azANOVA() function from the az package is (as of November 2018), not available for the latest R version (3.5.1).

I thus recommend using aov(), or a linear mixed model (LMM) with the function lmer() from the package lme4

Linear Mixed Models (LMMs)

What Linear mixed models (LMMs) also called Linear mixed effects models (LMEMs), also called Hierarchical Models

why, how?


Why LMMs

Why is a LMEM better than an ANOVA

blablabla

argument 2

blablabla


how to run LMMs in R

To fit LMMs in R, the data needs to be in long format, that means with several lines per subject. If you need to convert your data from wide to long format, you can do so with the function melt() from the reshape2 package

To run LMMs we will use the function lmer() from the package lme4. The formula looks similar to this:

model_1 <- lmer(DV ~ fixedEff1 * fixedEff2 + (1 + randomSlopes | randomIntercepts), data=nameofdatafile)

Let’s explain this step by step:

  • On the very left is the name of an object in which you will save the model and its output (e.g. model_1).
  • inside the function lmer() you first indicate the dependent variable (DV), which is what you measured (e.g. reaction time, height, temperature, physiology, etc.). The DV corresponds to the name of a column in the data frame we are using for the analyses.
  • the independent variables (IVs), on the right of the tilde (~), are the ones supposedly influencing the DV. These are also called explanatory variables, predictors, or fixed effects. The names of the fixed effects also correspond to columns in the data frame used.
  • in parentheses you put the random effects, which can potentially be all IVs (we will explain later what random effects are and how they differ form fixed effects)
  • don’t forget to indicate the data

Random effects for LMMs

Random effects are an important part of LMMs.

To explain the different types of random effects (intercept, slope), I will first use linear models with the funciton lm() and then show you in an actual LMM with the function lmer().

Random intercept and slope

Let’s load some data, called “sleepstudy” and available from the lme4 package.

In it, 18 subjects were repeatedly tested on a reaction time (RT) task during 10 consecutive days (nights), in which they were only allowed to sleep 3 hours.

# load data (in package lme4)
sleepstudy <- sleepstudy
str(sleepstudy)
## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

We could fit a linear regression model, to test if RT changed over days.

Be aware that Days is entered as numeric, and therefore the summary will only show 1 line for the effect of Days. This makes sense, because we indeed expect a linear increase of RT across the 10 Days of testing.

model_lm_base <- lm(Reaction ~ Days, sleepstudy)
summary(model_lm_base)
## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.848  -27.483    1.546   26.142  139.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  251.405      6.610  38.033  < 2e-16 ***
## Days          10.467      1.238   8.454 9.89e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared:  0.2865, Adjusted R-squared:  0.2825 
## F-statistic: 71.46 on 1 and 178 DF,  p-value: 9.894e-15

Let’s save the predicted values (marginal means) in the sleepstudy data frame (in a new variable called “model_lm_base”)

sleepstudy$model_lm_base <- fitted(model_lm_base)

And now let’s plot the original data with the superimposed regression line. Each dot represents one subject on a specific day

gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_point(alpha = 0.3, size = 3) +
    theme_bw()
print(gg)

Looking at the graph, it becomes clear that RTs increased over time, as one would expect since participants also gradually became more tired.

But how good is the fit for each participant? To visualize it, we can draw one scatterplot for each subject.

gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

Plotting the data and regression line separately for each subject makes it clear that our model_lm_base does not provide a very good fit to the data (e.g. see how far the regression line is from the data point for subject 309, 3010, 335, or 337).

Intercept by subject

We could therefore add another predictor to the regression, allowing for the regression line to have a different intercept (the point of intersection with the y-axis) for each subject.

Be aware that Subject is a factor (the numbers of each subject have been attributed randomly, and we do not expect RTs to change linearly from Subject 308 to Subject 372). Because of this, the summary provides 1 line for each direct comparison to the first Subject, which is number 308 (check the reference Subject number with levels(sleepstudy$Subject) )

model_lm_intercept <- lm(Reaction ~ Days + Subject, sleepstudy)
summary(model_lm_intercept)
## 
## Call:
## lm(formula = Reaction ~ Days + Subject, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.540  -16.389   -0.341   15.215  131.159 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  295.0310    10.4471  28.240  < 2e-16 ***
## Days          10.4673     0.8042  13.015  < 2e-16 ***
## Subject309  -126.9008    13.8597  -9.156 2.35e-16 ***
## Subject310  -111.1326    13.8597  -8.018 2.07e-13 ***
## Subject330   -38.9124    13.8597  -2.808 0.005609 ** 
## Subject331   -32.6978    13.8597  -2.359 0.019514 *  
## Subject332   -34.8318    13.8597  -2.513 0.012949 *  
## Subject333   -25.9755    13.8597  -1.874 0.062718 .  
## Subject334   -46.8318    13.8597  -3.379 0.000913 ***
## Subject335   -92.0638    13.8597  -6.643 4.51e-10 ***
## Subject337    33.5872    13.8597   2.423 0.016486 *  
## Subject349   -66.2994    13.8597  -4.784 3.87e-06 ***
## Subject350   -28.5311    13.8597  -2.059 0.041147 *  
## Subject351   -52.0361    13.8597  -3.754 0.000242 ***
## Subject352    -4.7123    13.8597  -0.340 0.734300    
## Subject369   -36.0992    13.8597  -2.605 0.010059 *  
## Subject370   -50.4321    13.8597  -3.639 0.000369 ***
## Subject371   -47.1498    13.8597  -3.402 0.000844 ***
## Subject372   -24.2477    13.8597  -1.750 0.082108 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.99 on 161 degrees of freedom
## Multiple R-squared:  0.7277, Adjusted R-squared:  0.6973 
## F-statistic: 23.91 on 18 and 161 DF,  p-value: < 2.2e-16

This output is a big mess, as it shows all Subjects in relation to subject 308.

A better way to test if there are general differences between the effects of Days across participants is by using the anova() function.

anova(model_lm_intercept)
## Analysis of Variance Table
## 
## Response: Reaction
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Days        1 162703  162703 169.401 < 2.2e-16 ***
## Subject    17 250618   14742  15.349 < 2.2e-16 ***
## Residuals 161 154634     960                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ok now let’s save the predicted values of the model including Subjects, and let’s plot the data with both the old regression line in gray (identical for all subjects) and the new regression lines in red (with different intercept for each subject)

sleepstudy$model_lm_intercept <- fitted(model_lm_intercept)
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_line(aes(y = model_lm_intercept), color = "red") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

R-squared

This model seems to fit better, as it adjusts the “height” of the regression line for each subject.

The formal way to assess the goodness of fit of a linear model is with the R2 statistic, which reflects the proportion of variance explained by the model (the bigger the better).

The R2 is provided at the end of the summary() output. But can also be extracted by itself with the command summary(modelname)$r.squared.

You should also know that R2 will always increase for models including more variables, even if they don’t help at all with the model fit. Therefore, it is better to look at the adjusted R2, which accounts for the number of variables in the model.

summary(model_lm_base)$r.squared; summary(model_lm_base)$adj.r.squared
## [1] 0.2864714
## [1] 0.2824628
summary(model_lm_intercept)$r.squared; summary(model_lm_intercept)$adj.r.squared
## [1] 0.727736
## [1] 0.6972965

Even with the adjusted R2 model_lm_intercept seems to explain much more variance than model_lm_base. But is it significantly more variance? To statistically compare the fit of 2 models, we use the function anova()

anova(model_lm_base, model_lm_intercept)
## Analysis of Variance Table
## 
## Model 1: Reaction ~ Days
## Model 2: Reaction ~ Days + Subject
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    178 405252                                  
## 2    161 154634 17    250618 15.349 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can now confidently say that model_lm_intercept (Model 2) has a significantly better fit to the data than model_lm_base (Model 1).

Slope by Subject

However, another thing that might vary between subjects is the steepness of the slope. To visualise it, lets fit a model with varying slopes by subject, while keeping the intercept constant.

model_lm_slope <- lm(Reaction ~ Days : Subject, sleepstudy)
sleepstudy$model_lm_slope <- fitted(model_lm_slope)
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_line(aes(y = model_lm_slope), color = "blue") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

How is the fit of this model compared to the others?

summary(model_lm_base)$adj.r.squared;  summary(model_lm_intercept)$adj.r.squared; summary(model_lm_slope)$adj.r.squared
## [1] 0.2824628
## [1] 0.6972965
## [1] 0.7346207
anova(model_lm_intercept, model_lm_slope)
## Analysis of Variance Table
## 
## Model 1: Reaction ~ Days + Subject
## Model 2: Reaction ~ Days:Subject
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1    161 154634                      
## 2    161 135567  0     19067

Although the adjusted R2 is slightly higher for the model_lm_slope, its fit does not differ significantly from the model_lm_intercept. They are both equally good (or bad) at predicting the data.

Intercept and Slope by Subject

The model that fits BEST is probably one that allows both the intercept and the slope to vary by participant.

model_lm_full <- lm(Reaction ~ Days + Subject + Days : Subject, sleepstudy)
sleepstudy$model_lm_full <- fitted(model_lm_full)
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_line(aes(y = model_lm_full), color = "green2") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

summary(model_lm_base)$adj.r.squared;  summary(model_lm_intercept)$adj.r.squared; summary(model_lm_slope)$adj.r.squared; summary(model_lm_full)$adj.r.squared
## [1] 0.2824628
## [1] 0.6972965
## [1] 0.7346207
## [1] 0.7935847
anova(model_lm_slope, model_lm_full)
## Analysis of Variance Table
## 
## Model 1: Reaction ~ Days:Subject
## Model 2: Reaction ~ Days + Subject + Days:Subject
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    161 135567                                  
## 2    144  94312 17     41255 3.7053 7.209e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, the model_lm_full explains 79% of the variance (adjusted R2), and fits significantly better than the 2nd best model_lm_slope.

the same with LMMs

Now let’s do the same but with LMMs instead of linear regressions.

One important difference between linear models and LMMs is that we can’t fit a LMM without random effects (we need at least the random intercepts).

model_lmer_intercept <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
model_lmer_slope <- lmer(Reaction ~ Days + (0 + Days|Subject), sleepstudy)
model_lmer_full <- lmer(Reaction ~ Days + (1 + Days|Subject), sleepstudy)
sleepstudy$model_lmer_intercept <- fitted(model_lmer_intercept)
sleepstudy$model_lmer_slope <- fitted(model_lmer_slope)
sleepstudy$model_lmer_full <- fitted(model_lmer_full)
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lmer_intercept), color = "red") +
    geom_line(aes(y = model_lmer_slope), color = "blue") +
    geom_line(aes(y = model_lmer_full), color = "green2") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

The model_lmer_full obviously fits better, but is it also significantly better?

marginal and conditional R-squared for LMMs

Unfortunately, it is slightly more complicated to get the R2 and adjusted R2 for LMMs fitted with lmer().

For that, we will need the package MuMIn and its function r.squaredGLMM, which will provide the marginal and conditional R squared.

The marginal R squared (R2m) describes the proportion of variance explained by the fixed factor(s) alone. The conditional R squared (R2c) describes the proportion of variance explained by both the fixed and random factors. We should pay attention to R2c, since the random factors are all that is changing between the models.

r.squaredGLMM(model_lmer_intercept); r.squaredGLMM(model_lmer_slope); r.squaredGLMM(model_lmer_full)
##            R2m       R2c
## [1,] 0.2798856 0.7042555
##            R2m       R2c
## [1,] 0.2794061 0.7411658
##            R2m       R2c
## [1,] 0.2786382 0.7992289
anova(model_lmer_intercept, model_lmer_slope, model_lmer_full)
## refitting model(s) with ML (instead of REML)
## Data: sleepstudy
## Models:
## model_lmer_intercept: Reaction ~ Days + (1 | Subject)
## model_lmer_slope: Reaction ~ Days + (0 + Days | Subject)
## model_lmer_full: Reaction ~ Days + (1 + Days | Subject)
##                      Df    AIC    BIC  logLik deviance  Chisq Chi Df
## model_lmer_intercept  4 1802.1 1814.8 -897.04   1794.1              
## model_lmer_slope      4 1782.1 1794.8 -887.04   1774.1 19.998      0
## model_lmer_full       6 1763.9 1783.1 -875.97   1751.9 22.141      2
##                      Pr(>Chisq)    
## model_lmer_intercept               
## model_lmer_slope      < 2.2e-16 ***
## model_lmer_full       1.557e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model_lmer_full has the heighest R2c of 0.8, and fits significantly better than the 2nd best model, which is model_lmer_slope.

lm() vs lmer()

One might wonder, what the difference is between: 1. a linear regression including the factors Days and Subject, and 2. a linear mixed model (LMM) with the fixed factor Days and as random factor intercepts by Subject

Well at first there is none. Indeed, the coefficients for the main effect of Days are identical

model_lm_intercept <- lm(Reaction ~ Days + Subject, sleepstudy)
model_lmer_intercept <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
model_lm_intercept$coef[2]
##     Days 
## 10.46729
fixef(model_lmer_intercept)[2]
##     Days 
## 10.46729

And the regression lines on the plots overlap completely

model_lmer_intercept <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
summary(model_lmer_intercept)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 251.4051     9.7467   25.79
## Days         10.4673     0.8042   13.02
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371
sleepstudy$model_lmer_intercept <- fitted(model_lmer_intercept)
gg <- ggplot(sleepstudy, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_line(aes(y = model_lm_intercept), color = "red") +
    geom_line(aes(y = model_lmer_intercept), color = "red", linetype = "dashed") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

However, if we remove days 1 to 8 from subject 335, we will start to see a difference.

Notice how for subject 335, who now only has 2 datapoints, the red regression line is somewhat “pulled” up in direction of the regression line of the model_lm_base.

This happens because subject 335 provides less data, and so the average regression line is taken more into account.

sleepstudy2 <- sleepstudy[-c(82:89),]

model_lm_intercept2 <- lm(Reaction ~ Days + Subject, sleepstudy2)
model_lmer_intercept2 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy2)
sleepstudy2$model_lm_intercept2 <- fitted(model_lm_intercept2)
sleepstudy2$model_lmer_intercept2 <- fitted(model_lmer_intercept2)
gg <- ggplot(sleepstudy2, aes(x = Days, y = Reaction, group = Subject)) +
    geom_line(aes(y = model_lm_base), color = "darkgrey") +
    geom_line(aes(y = model_lm_intercept2), color = "red") +
    geom_line(aes(y = model_lmer_intercept2), color = "red",  linetype = "dashed") +
    geom_point(alpha = 0.3, size = 3) +
    facet_wrap(~Subject) +
    theme_bw()
print(gg)

Summary random effects

Bates <- keep it maximal the other guy <- rather not In general, you might want to plot the fit of your model to your data, and compare several models to each other. However, complex models with many fixed and random factors may not converge, and so you will have to reduce them.